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We the identification of a novel coronavirus associated with the severe acute respiratory syndrome (SARS), 
computational analysis of its RNA genome sequence is expected to give useful clues to help elucidate 
the origin, evolution, and pathogenicity of the virus. In this paper, we study the collective counts of palin- 
dromes in the SARS genome along with all the completely sequenced coronaviruses. Based on a Markov-chain 
model for the genome sequence, the mean and standard deviation for the number of palindromes at or above 
a given length are derived. These theoretical results are complemented by extensive simulations to provide 
empirical estimates. Using a z score obtained from these mathematical and empirical means and standard 
deviations, we have observed that palindromes of length four are significantly underrepresented in all the coro- 
naviruses in our data set. In contrast, length-six palindromes are significantly underrepresented only in the 
SARS coronavirus. Two other features are unique to the SARS sequence. First, there is a length-22 palindrome 
TCTTTAACAAGCTTGITAAAGA spanning positions 25962-25983. Second, there are two repeating length-12 palindromes 
TTATAATTATAA spanning positions 22712-22723 and 22796-22807. Some further investigations into possible bio- 


logical implications of these palindrome features are proposed. 
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1. Introduction 

In March 2003, a novel coronavirus associated with 
the severe acute respiratory syndrome (SARS) was iden- 
tified. The outbreak of SARS in different parts of 
the world, causing hundreds of deaths, has initiated 
much international effort that includes clinical, epi- 
demiologic, and laboratory investigations with the 
aim of controlling the spread of the virus (Bloom 2003, 
Marra et al. 2003, Ruan et al. 2003, Rota et al. 2003). 
Although the world was cleared of new SARS cases 
by July 2003, the pursuit for a thorough understand- 
ing of the origin, evolution, and pathogenicity of this 
deadly virus continues. 

With the availability of the complete genome 
sequence of the SARS and several other coronaviruses 
in public databases (e.g., GenBank), it is possible to do 
a computational analysis of the viral genome, looking 
for unusual genome sequence features either unique 
to the SARS virus or common to the coronavirus 
family. Such information can give clues to the ori- 
gin, natural reservoir, and evolution of the virus. It 


331 


may contribute to the studies of the immune response 
to this virus and the pathogenesis of SARS-related 
disease (Rota et al. 2003). 

Statistical and experimental studies of palindromes 
in the other classes of viral genomes, such as the 
double stranded DNA viruses, bacteriophages, retro- 
viruses, etc., have been performed (Cain et al. 2001, 
Dirac et al. 2002, Hill et al. 2003, Karlin et al. 1992, 
Leung et al. 2002, Rocha et al. 2001, among others). 
These studies have suggested that palindromes might 
be involved in the viral packaging, replication, 
and defense mechanisms. Unlike these well-studied 
viruses involved in fatal diseases such as AIDS and 
various cancers, the coronaviruses have not received 
as much attention until the recent outbreak of SARS. 

In the present study, we focus our attention on 
palindromes in the positive-stranded RNA genomes 
of coronaviruses. In accordance with GenBank con- 
vention, we represent an RNA sequence as a string 
of letters from the alphabet “ = {A,C,G,T}. The four 
letters respectively stand for the RNA bases adenine, 
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cytosine, guanine, and uracil. The letters A and T are 
complementary to each other because adenine and 
uracil form hydrogen bonds with each other. The 
same applies to C and G. A palindrome is a symmet- 
rical word such that when it is read in the reverse 
direction, it is exactly the complement of itself. For 
example, ACGT is a palindrome of length four. A palin- 
drome is necessarily even in length because the mid- 
dle base in any odd-length nucleotide string cannot 
be identical to its complement. 

Several points are worth noting from this initial 
exploratory analysis of palindromes in the corona- 
virus genome sequences: (1) The palindrome counts 
in the coronavirus genomes seem lower than what 
would be expected from random sequences. (2) The 
SARS virus contains an exceptionally long palin- 
drome with 22 nucleotide bases. This is the longest 
among all palindromes observed in the coronaviruses. 
(3) There are two copies of a length-12 palindrome 
situated within 100 bases of each other in the 
SARS genome. This is not observed in the other 
coronaviruses. 

Whether or not these palindrome-related features 
have any biological relevance will, of course, have to 
rely on careful laboratory investigations by the virol- 
ogists. At this stage, however, it would be only rea- 
sonable to assess whether these features can indeed 
be considered statistically unusual when compared 
to random-sequence models. Our observations call 
for investigations into the probability distributions 
of palindrome counts, lengths, and locations in a 
random sequence. This paper will focus only on 
the palindrome counts, leaving the others for future 
studies. 

In the next section, the mathematical formulas for 
the theoretical mean and variance for the number 
of palindromes at or above a prescribed length are 
derived based on a Markov-chain random-sequence 
model. Section 3 summarizes the computational 
results in comparing palindrome counts of the coro- 
navirus genomes to the random-sequence models. 
In §4, we propose some biological questions that may 
be investigated in relation to these observed nonran- 
dom features. A few concluding remarks are given 
in §5. 


2. Palindrome Counts in 
Markov-Chain Models 


The main objective of this paper is to assess whether 
the palindrome counts in the coronavirus genomes 
are observed more (or less) frequently than expected, 
under some specified probability models. We model 
the genome sequence as a realization of a sequence of 
random variables €,, &,...,€, taking values in 4 = 
{A,C,G,T} and n is the genome length. Throughout, 
we will assume that either 


(i) {€:, &,...,&,} are independent and identically 
distributed (MO); or 

(ii) {€,, &,...,€,} form a stationary Markov chain 
of order one (M1). 

For studying DNA words of length k, one can 
choose to use Markov chains of order up to the maxi- 
mum order of k — 2 as the sequence model. A higher- 
order Markov chain will better fit the data sequence, 
but at the same time the number of parameters in 
the model increases exponentially. In this study, we 
carried out some simulations using the second-order 
Markov-chain model (M2). The computation takes 
much longer, but the z scores obtained gave the same 
interpretation as that of the M1 model. We therefore 
content ourselves with the MO and M1 models for our 
analysis of palindromes of length four and above. 

We are interested in deriving the mean and stan- 
dard deviation of the random variable X,, total num- 
ber of palindromes of length at least 2L under the MO 
and M1 sequence models. This will help quantify 
the extent of deviation of the observed palindrome 
counts in the coronavirus genome from the expected 
counts under the specified probability model. For 
L<k<n-—L, define 


1 if the kth base is the left center of a 
I, = palindrome of length > 2L 


0 otherwise 


We say that a palindrome occurs at k when I, =1. 
Therefore, X, = oP]. Note that the distribution 
of I, depends only on the joint distribution of 
(€_t41,+-++, &41)- Under the MO or M1 model, the 
joint distribution of (&_741,--.,&€41) is independent 
of k. Hence P[I, = 1] is a constant in k. Similarly 
P[I; =1, I, =1] depends only on |j —k|. Therefore, for 
L<k<n—Land1<d<n—L-—k, we define 


y(0):=P[,=1] and y(d@d):=P[,=1, hy,=1). 

The expressions of y(0) and y(d) are crucial to cal- 
culating the mean and variance of X, (see Proposi- 
tion 3 below). Lemma 1 (respectively, Lemma 2) deals 
with the computation of y(0) and y(d) under the M1 
(respectively, MO) sequence model. Indeed, we will 
deduce Lemma 2 from Lemma 1. 

Throughout, we use b’ to denote the complemen- 
tary base of b, and w’ the inversion (i.e., the comple- 
mentary word read in reverse) of the word w. There 
are quite a few details to work out all the possible 
overlap cases because the overlap structures depend 
on the relative sizes of d (the extent of overlap) and 2L 
(the cutoff length of a palindrome). However, there 
are only two basic patterns in the overlap. In the 
first pattern (as illustrated by Figure 1b), the shaded 
segment, due to the complimentary requirement of 
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palindrome C, 4 OL > 
a c b 
< 2L > palindrome Cx+a 
(a) d > 2L. Here the palindromes C, and C,,, do not overlap and c denotes 
the segment between them. 
palindrome C, 
< 2L > 
w' u' | u w 
< d > w v | v' w' 
+ 2L-d—> 
palindrome Cx+a 
(b) L<d<2L. Here w denotes the common segment of palindromes C, and 
Cyiqg. And w determines the left end and right end of C, and C;,4. 
palindrome Cx 
< L > 
v w' w w' Ww w' w v' 
t+d— | u' w w' Ww w' Ww w' u 
<«— d—»<« gd ><> 
palindrome Cx+a 
(c) 1<d<L with q as quotient when L is divided by d and r the remainder. 
The shaded segment determines the rest of both palindromes 
Figure 1 Overlapping Structures of Palindromes C, and C,,, for Different Values of d 
Note. (a), (b), and (c) are drawn with different scales. 
a palindrome, will uniquely determine the left and (ii) L<d <2L: 
right ends of C, and C,,,. And in the other pattern gig 
(as illustrated by Figure 1c), the shaded segment will y@= > a(b,) P(b,, b,)P(b,, b',) Il P(b;, By) 
determine the rest of both palindromes. In Figure 1a, By y wees Bgest j=l 
even though palindromes C, and C,,, do not actually Ae 


overlap (i.e., d > 2L), the occurrence of a palindrome 
at k will still have an effect on the probability that a 
palindrome will occur at k+d under the M1 sequence 
model. Lemma 1 provides expressions of y(d) under 
all possible situations. 


LEMMA 1. Suppose the genome sequence is modeled as 
a stationary Markov chain of order one with stationary dis- 
tribution w := (w7(A), a7(C), 7(G), 7(T)). For a,be A 
and m>1, let P(a,b) and P' (a,b) respectively denote 
the transition probability and the m-step transition proba- 
bility from base a to base b. 


(a) We have 
L-1 
yO)= YY (by) P(b., b,) TPG), bj41) PG}, 8) 
by, ..., bbEd j=l 


(1) 
(b) For d>1, we have the following three cases: 
(i) d>2L: 


yd) = So m(a,)P(a,, a,)P(b,, b,) PO "*” (ai, by) 
“nen 
= 
: TT[P(;. A44)P(ais1, a) P(b;, Dis )P (Bir, b')]. 


jal 


TTP Gu 


bi) P(By tats , bi 141) : 
l=1 


(iii) 1<d <L: we let L=qd+r. 


w=] qt+1 
ya= YS Ky Cpundy) [Pb ts) 1a) 
by,...,bgE4 j=l 
d-1 q 
. [PU b,) TI P(;, b.)| 
j=l 
where 
Ky aire 7 By) 
ar (Ba-o41)P (BBs) THC PCO), Baa) 
‘ i eee P(b;, bi41) r>2 
7 (Bjri)P(b4, bi) r=1 
a (0) /P (ba, 04) r=0 


Proor. (a) Note that a palindrome of length at least 
2L is of the form b,---b,b,---b; where b,,...,b, € A. 
Therefore, 


y(0) = 


se 


by, ..., bbe 


P[b,---b,b, --- bi]. 
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Because form vww::-www'u. Therefore, 
L-1 : q 
P[b, ---b,b, ---;] = w(b)] TP, Bn) y(d) = Sy P[Bargt 00+ Bg By - +  BiDy Bye 
i= By, ., bg esd es 
Pte. ba [TTP ay ¥) DUD Deal Deeb, (2) 
———_——_—_————— 


(1) follows immediately after rearranging terms. 

(b) To compute the overlap probability y(d), i-e., the 
probability that there are palindromes at k and k+d, 
we call the stretch of bases &_),1--+&4.9;, the span of 
palindromes C, and C,,4. 

For (i) d > 2L: The span s of the two palindromes C, 
and C,,, is of the form acb where a= 4, ---4,a,--+- a, 


C=C,-++Cy_»,, and b=b,---b,b, ---b;. Hence, 
y(d) = XY Pls] = 7 > Pla] P[eb, | a, ] PLb |b] 
a,c,b a,b c 
= VPlalPe**” (ai, by) P[b | by]. 
a,b 


Hence (i) follows immediately from 
L-1 L-1 

Pla} = m(0)] [1 PCa 9)1)| P(a.-ai)| FT PCa) 
j=l j=l 


and 


L-1 


P[b | b,] = [TI P(b,, bs) |PCb, [1 PBs, ¥) 


For (ii) L < d < 2L: Refer to Figure 1(b), let w = 
bi-t41°:: 6, denote the common segment of palin- 
dromes C, and C,,;. Assuming d > L, let u= D, --- by_;, 
and v=0,,,---b,; we can represent C, = w'u'uw and 
C.4.q = wvv'w' where 0,,...,0; € 4. Therefore, 


yd@)= >> P[w'u'uww'w’] 
by, «., bgesd 
= DO PlbL bby babi By]. 
by, ..., bge 


Writing it out in terms of the initial distribution and 
transition probabilities, we have proved (ii) for d > L. 
The case for d= L is similar: Take u and v as null 
words and proceed as in the case d > L. 

To prove (iii), we consider the case r > 1 first. 
This time, let w = b,---b,; denote the first d bases to 
the right of the center of C, and to the left of the 
center of C,,,. Let u=0,---b, and v= by_,4,-+-Dy, 
respectively denote the first and last r bases of w. 
Figure 1(c) displays the necessary structure in C, and 
Cy. for both of them to be palindromes when g=3. 
If gq is odd, then the span of C, and C,,, is of the 


q 


If q is even, then the span of C, and C,,, is changed 
accordingly to the form u’ ww’ --- ww’ wv’ and 
eee 


1 q 
y(d) = > P[b)---b, by +++ bybi +b, + 
;: >" S—Ss— 
1 
Ee 
—$<—S SS 


q 


bi wil (3) 


By making the one-to-one transformation in the sum- 
mation, b, > b',,...,b; > b;, and we can see that both 
sums on the RHS of (2) and (3) are the same. So with- 
out loss of generality, we compute y(d) under the 
assumption that q is odd. The crucial step is then to 
calculate the probability of the span of C, and C,4, 
and part (iii) will follow immediately from summing 
over all possible b,,...,b,. We first consider r > 2, 
then 
od eee OF Pe 
1 
bi --- Bib, --+b, bi --- bib, ---b, | 
a 
q 


r—-1 
= (By 41)P(B,b {TT PO, bx) 


j=l 


| Tl Pb by) |[ Pte ti) FL bin.6)] 


jad—r41 j=l 
d-1 
c [P(.2 1) u Pb, ba] (4) 


For r= 1, (4) becomes 
P[b, bi, -++ bpd, «Bg e+ Bye BLD, «+ By By 
—— — 


1 q 


d-1 q+1 
= m(b,)P(b;,, by) Ee b,) TL Ps, ¥)| 


b,b,] 


[Pei8 HTTP, bad) 


If r=0, reasoning similar to the above leads us to 
consider just the case q is odd. However, the span 
of C, and C,,,; becomes (one can take u and v as 
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empty words) w'w---w ww. And hence, 


1 q 


P[Bj-+-biby-+-Bg-+ Bis BLD, + bgbiy D4] 
a 


1 4 
77 (b') ! bate ! / [ 
= ——*_| P(b,,b P(b'_.,,b 
coacall ( d/ TTP ( jt+l’ A) 


; [P(.8 H)TTPO, bad] O 


Under the MO model, the stationary distribution 
T = (Pa, Pc, Pc, Pr), and the transition probabilities 
P(a,b) =p, and P™ (a,b) =p, for any a,be A, m>1. 
Substituting these into Lemma 1(a) and (i) and (ii) of 
Lemma 1(b) immediately gives us the corresponding 
parts in Lemma 2 below. Part (iii) of Lemma 1(b) can 
be simplified further according to how big the remain- 
der r is in relation to d. We shall omit the details. In 
this way, we have deduced the following Lemma 2, 
which was first proved in Leung et al. (2002). 


LEMMA 2. Suppose the genome sequence is modeled 
as MO and let 


6 :=2(paPr + PcPc)- 


(a) We have 
y(0) = 6". 


(b) For d>1, we have the following four cases: 
(i) d>2L: 
y(d) = 0"; 


(ii) L<d<2L: 
y(d) = 


when 1<d<L we let L=qd+r where 0 <r<d, and 
consider two subcases according to how big the remainder r 
is in relation to d. 

(iii) 1<d<Land0<r<(d+1)/2: 


y(d) = [2((Papr)™* + Pepe) )]" 
: [(PaPr) (Pa + Pr) + (cP) (Pe + po) |"? ; 
(iv) 1<d<Land (d+1)/2<r<d: 
y(d) = [2((papr) + Pepa) PO 


-[(@apr)™* (pa + Pr) + PePc)* (Pe + Pc) 


| ame 


Pp spr(Pa t+ Pr) + PcPcPc + Pc) 


i eae 


Proposition 3. With the I,’s as defined at the begin- 
ning of §2, the total number of palindromes of length 
at least 2L is given by X,:= ty I,. And hence, 


A, = E(X,;) = (n—2L +1)y(0) 


and 


a; := Var(X,) = (n—2L+1)y(0)(1 — y(0)) 
n—2L 


+2 9) (n—2L+1—d)[y(d) — y(0)], 
d=1 


where y(0) and y(d) are given as in Lemma 2 under 
the MO sequence model, and Lemma 1 under M1 sequence 
model. 


Proor. The first equation follows immediately from 
taking expectations on both sides of X,; := joy I, and 


f= Dventy42" > Cov(I;, I.) 
j=l k=j+1 
= ; — 2L+1)y(0)(1— y(0)) 
n—L—1 n—L-j 
+200 DY [PU =1, Tina =11- vO] 
j=l d= 
= (n—2L+1)y(0)(1— y(0)) 
n—2L 
+2 > (n—2L+1-—-d)[y(d)— y(0)’]. O 
d=1 


3. Palindrome Counts in 


Coronaviruses 

The derived means and variances under the MO 
and M1 sequence models enable us to assess whether 
the observed palindrome count in a genome is too 
abundant or rare. The z score defined in (5) below is a 
modification of a generally accepted measure of over 
(or under)representation of a DNA word. For L > 2, a 
standardized frequency under the assumption of the 
M1 sequence model is defined as 

Zu = Xi = Hot (5) 

om 

where X, is the observed number of palindromes 
of length at least 2L, and py, and oy, denote its 
expected value and standard deviation, respectively. 
(For simplicity, we do not indicate the dependence 
of w and o on L.) The corresponding z score is defined 
similarly for the MO sequence model. When L is small 
compared with the genome length n, X; is a sum 
of weakly dependent random indicators [, and it is 
therefore well approximated by a normal distribution. 
Indeed, if we let x denote the number of occur- 
rences of the jth palindrome in the genome, then 
the count vector (X(”,X,..., x) will converge 
to a multivariate normal distribution as 1 — oo (see 
Theorem a 5 in Waterman 1995). And hence X; = 
Vi<jcst x will converge to a normal distribution as 
nN — oo. Foe L=2 or 3, and n in the range 30,000, 
we expect that the distribution of the z scores will 
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Figure 2 Normal Q-Q Plots of Counts of Palindromes of Length Four 


(Top) and Six (Bottom) in the 1,000 Random Sequences 
Under the M1 Model for the SARS Genome 


be approximately standard normal. The near-straight 
lines in the Q-Q plots in Figure 2 confirmed that 
this is the case. This motivates our definition: The 
count is said to be over (or under)represented, if the z 
score is greater than 1.645 or less than —1.645, respec- 
tively (ie., in the upper or lower 5% of a standard 
normal distribution, as commonly used in one-tailed 
hypothesis tests in biological experiments). However, 
it should be emphasized that these cutoff z score val- 
ues can only be considered as a convenient statistical 
guideline to help bring out interesting observations 


rather than a strict criterion to lead to a definitive 
conclusion. 

We compute the z scores of the genomes in the fol- 
lowing data set: It is composed of seven coronaviruses 
with complete genome sequences and four other 
RNA viruses. For some coronaviruses, the genome 
sequences of multiple strains of the same virus are 
available. Only one strain is included in our data 
set because their genomes are very similar. Four 
other RNA viruses outside the coronavirus family are 
included in the data set. Two of these (the rubella 
virus and the equine arteritis virus) have positive- 
stranded RNA genomes like the coronaviruses, one 
(rabies virus) has a negative-stranded RNA genome, 
and the remaining one (HIV) is a retrovirus. Table 1 
lists the names of the viruses, abbreviations, GenBank 
accession numbers, genome lengths, and base compo- 
sitions of the seven coronaviruses and the other four 
RNA viruses. Table 2 displays the z scores for counts 
of palindromes of length four and above under the 
MO and M1 models. 

Table 2 indicates that there is a general avoidance 
of palindromes of length four and above in the coro- 
navirus genomes. A natural question that follows is 
whether palindromes of a given exact length are also 
underrepresented in these viruses. 

To answer this question, one would need the 
mean v and standard deviation 7 for the count Y, of 
palindromes of exact length 2L. It is easy to obtain 
the mean because v = E(Y,) = E(X,) — E(X,,,). The 
standard deviation of Y, can be derived with suit- 
able modification of the method of proofs in Lem- 
mas 1 and 2, but the expression obtained is rather 
lengthy due to an increase in the overlapping struc- 
tures. Instead, we adopt an alternative approach to 
estimate the standard deviation by simulation, which 
at the same time serves to validate our derived means 
and standard deviations. This approach has a fur- 
ther advantage of giving us the empirical distribu- 
tions, and Figure 2 shows that for small values of L, 
the distributions are well approximated by normal 
distributions. 


Table 1 List of Seven Coronaviruses and Four Other RNA Viruses to be Analyzed 

Name Abbrev. Accession Length Base composition 
SARS coronavirus Urbani SARS AY278741 29,727 (0.28, 0.20, 0.21, 0.31) 
Avian infectious bronchitis virus AIBV NC_001451. 27,608 (0.29, 0.16, 0.22, 0.33) 
Bovine coronavirus BCoV NC_003045. 31,028 (0.27, 0.15, 0.22, 0.36) 
Human coronavirus 229E HCoV NC_002645. 27,317 (0.27, 0.17, 0.22, 0.35) 
Murine hepatitis virus MHV NC_001846 31,357 (0.26, 0.18, 0.24, 0.32) 
Porcine epidemic diarrhea virus PEDV NC_003436. 28,033 (0.25, 0.19, 0.23, 0.33) 
Transmissible gastroenteritis virus TGV NC_002306.2 28,586 (0.29, 0.17, 0.21, 0.33) 
Rubella virus RUV NC_001545. 9,755 (0.15, 0.39, 0.31, 0.15) 
Equine arteritis virus EAV NC_002532.2 12,704 (0.21, 0.26, 0.26, 0.27) 
Rabies virus RV NC_001542. 11,932 (0.29, 0.22, 0.23, 0.26) 
Human immunodeficiency virus 1 HIV-1 NC_001802. 9,181 (0.36, 0.18, 0.24, 0.22) 
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Table 2 —_z Scores for Counts of Palindromes of Length Four and Above 
Virus Counts Pro (Sua) Har (1) Zo Zu 

SARS 1,554 1,981.0 (43.4) 1,687.6 (40.3) —9.83 —3.32 
AIBV = 1,578 =~ 1,896.6 (42.8) 1,675.3 (38.2) —7.45 —2.54 
BCoV = 1,886 = 2,115.6 (45.4) 2,007.5 (45.5) -5.06 —2.67 
HCoV 1,451 1,843.6 (42.2) 1,567.6 (37.0) —9.30 —3.15 
MHV = 1,793 2,006.6 (43.8) 1,911.3 (41.4) -4.88  —2.86 
PEDV 1,457 ~—- 1,781.6 (41.2) 1,578.8 (38.3)  —7.87 —3.18 
TGV 1,610 1,993.9 (43.8) 1,695.6 (38.9)  —8.76 —2.20 
RUV 868 793.2 (28.0) 845.6 (28.3) 2.67 0.79 
EAV 672 784.3 (27.2) 710.4 (25.8) —4.13 —1.49 
RV 559 758.0 (26.7) 564.3 (23.0) —7.45 —0.23 
HIV-1 475 591.9 (23.1) 480.2 (21.9) -3.33  —0.24 


For each virus in Table 1, 1,000 random sequences 
were generated for both the MO and M1 models 
using scripts written in the R language (http://www. 
r-project.org/). The sequences are run through the 
palindrome program which is part of EMBOSS 
(European Molecular Biology Open Software Suite, 
Rice et al. 2000) to extract the palindrome positions 
and length. Each output is then read by R again 
and the counts of palindromes of various length are 
tabulated. 

Tables 3 and 4 present the counts of palindromes 
of exact length four, six, and eight, along with their 
expected values v, estimated standard deviations 7, 
and z scores. Based on the z scores, Tables 3 and 4 
indicate that length-four palindromes are significantly 
underrepresented across the coronavirus family under 
both the MO and M1 sequence models. However, for 
length-six palindromes, SARS is the only member of 
the coronavirus family that shows underrepresenta- 
tion under the M1 sequence model. For length eight 
or above, no distinct patterns are observed. 

For palindromes of length four and above, it is pos- 
sible to fit higher-order Markov models to the genome 
sequence. For example, the second-order Markov- 
chain model that takes the base, dinucleotide, as well 
as trinucleotide composition into account, can be used 


to calculate the z scores. We simulated 1,000 random 
sequences with the M2 model, but the results did not 
differ much from the M1 model. 

As the EMBOSS palindrome program provides us 
with a detailed listing of all occurrences of palin- 
dromes of length four and above, we are able to 
notice two unique features in SARS. First, the SARS 
sequence contains a long palindrome of length 22, the 
longest among all palindromes observed in the coro- 
naviruses. Second, there are two identical, length-12 
palindromes situated within 100 bases of each other 
in the SARS genome. These are not observed in the 
other coronaviruses. Although contributing little to 
the total palindrome counts, these three palindromes 
appear unusual enough to warrant further study of 
their possible biological roles, as discussed in the next 
section. 


4, Discussion 

Various statistical assessments of unusual abundance 
and rarity of individual words, including individ- 
ual palindromes, in nucleotide sequences have been 
done using random-sequence models in a number of 
previous studies (Karlin et al. 1992; Merkl and Fritz 
1996; Rocha et al. 1998, 2001; Schbath et al. 1995, to 
name just a few). The present study, however, aims 
at investigating the unusual abundance and rarity of 
palindromes collectively rather than individually. The 
mathematical results in §2 provide a directly com- 
putable formula to give a single z score for all palin- 
dromes with a given minimal length. We hope the 
exploratory results in this paper will serve as a basis 
for more detailed investigations to see how palin- 
dromes might be involved in important biological 
mechanisms of the coronaviruses. 

There are two random sequence models MO and M1 
used in this paper. Because M1 can take the genome 
dinucleotide compositions into consideration while 
MO cannot, M1 is preferred over MO. Comparatively, 
the z scores under M1 are less extreme than those 


Table 3 Z Scores for Palindromes of Various Lengths Under the MO Model 
Length-four palindromes Length-six palindromes Length-eight palindromes 

Counts Yo (mo) Zuo Counts Yo (To) Zo Counts Yio (Fo) Zuo 
SARS 1,144 1,469.6 (36.9) —8.82 284 379.4 (19.4) —4.92 90 97.9 (9.7) —0.82 
AIBV 1,142 1,399.5 (37.5) —6.87 320 366.8 (18.6) —2.52 91 96.1 (9.9) —0.52 
BCoV 1,360 1,563.2 (40.4) —5.03 389 408.2 (20.4) —0.94 98 106.6 (10.7) —0.80 
HCoV 1,054 1,364.7 (36.9) —8.42 287 354.5 (18.9) —3.57 82 92.1 (9.8) —1.03 
MHV 1,328 1,499.0 (38.0) —4.50 340 379.2 (19.5) —2.01 82 95.9 (9.9) —1.41 
PEDV 1,079 1,332.5 (36.5) —6.94 274 335.9 (18.5) —3.35 79 84.7 (9.2) —0.62 
TGV 1,180 1,467.3 (38.4) —7.48 306 387.5 (19.7) —4.14 85 102.3 (9.8) —1.77 
RUV 610 567.0 (22.8) 1.89 167 161.7 (12.6) 0.42 68 46.1 (6.9) 3.17 
EAV 479 589.4 (23.8) —4.64 145 146.4 (12.3) —0.12 36 36.4 (6.1) —0.06 
RV 407 567.0 (23.7) —6.75 102 142.9 (12.4) —3.30 38 36.0 (5.9) 0.34 
HIV-1 347 416.6 (20.1) —3.46 89 102.1 (10.2) —1.29 34 25.0 (4.8) 1.87 
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Table 4 Z Scores for Palindromes of Various Lengths Under the M1 Model 
Length-four palindromes Length-six palindromes Length-eight palindromes 
Counts Yn (Fu) Zan Counts Yon (Fa) Zan Counts Yn (Tui) Zu 

SARS 1,144 1,242.7 (33.4) —2.96 284 327.3 (18.0) —2.41 90 86.5 (9.4) 0.37 
AIBV 1,142 1,229.8 (35.4) —2.48 320 326.9 (17.8) —0.39 91 87.0 (9.4) 0.42 
BCoV 1,360 1,476.5 (37.2) —3.13 389 390.4 (19.5) —0.07 98 103.4 (9.8) —0.55 
HCoV 1,054 1,146.9 (34.5) —2.69 287 307.6 (17.4) —1.18 82 82.7 (8.9) —0.08 
MHV 1,328 1,421.3 (37.8) —2.47 340 364.3 (18.8) —1.29 82 93.5 (9.8) —1.17 
PEDV 1,079 1,169.8 (34.5) —2.63 274 302.9 (17.5) —1.65 79 78.6 (9.1) 0.05 
TGV 1,180 1,239.5 (34.0) —1.75 306 333.2 (18.4) —1.48 85 89.8 (9.7) —0.49 
RUV 610 604.3 (24.5) 0.23 167 172.5 (13.8) —0.40 68 49.2 (6.9) 2.72 
EAV 479 529.6 (22.5) 2.25 145 134.8 (11.3) 0.91 36 34.3 (5.7) 0.30 
RV 407 415.2 (19.1) —0.43 102 109.8 (10.4) —0.75 38 28.9 (5.3) 1.71 
HIV-1 347 358.3 (18.7) —0.60 89 91.0 (9.6) —0.21 34 23.1 (4.5) 2.42 


of MO. M1 is therefore more conservative in declaring 
the palindrome counts in a genome to be significantly 
different from those in random sequences. We shall 
base our discussion of the results on M1 whenever 
possible. 

The counts of palindromes of length at least four 
in each coronavirus analyzed are significantly lower 
than expected (see Table 2). As the palindrome length 
increases to six and above, the underrepresentation 
of palindromes no longer holds across the family 
(theoretical z scores under M1 range from —1.66 
to 0.46). This suggests that there is a family-wide 
avoidance of palindromes of exact length four in the 
coronaviruses, which is confirmed by the empirical z 
scores for exact-length palindromes in Tables 3 and 4. 
With this knowledge, a thorough examination of the 
relative abundance of individual length-four palin- 
dromes, conditional on the total length-four palin- 
drome count is called for. We are in the process of 
setting up such a study. 

Although the underrepresentation of length-four 
palindromes is observed for all of the coronaviruses 
in our data set that include members from all three 
antigenic groups (Marra et al. 2003), this underrepre- 
sentation is not universally true in all RNA viruses, 
as demonstrated by the other RNA viruses outside 
the coronavirus family. While it is conceivable that 
palindrome underrepresentation is just a characteris- 
tic of the common ancestor of the coronaviruses, it 
is worth noting that the characteristic is preserved in 
the family despite the reputation for RNA viruses to 
be nature’s swiftest evolvers (Worobey and Holmes 
1999). So far, we cannot find any previous report 
of underrepresentation of short palindromes in RNA 
viruses with eukaryotic hosts. However, avoidance of 
short palindromes in some bacterial and phage DNA 
genomes has been reported in several studies (Karlin 
et al. 1992; Merkl and Fritz 1996; Rocha et al. 1998, 
2001, among others). The phenomenon is generally 
explained in relation to the defense mechanisms of the 


bacterial and phage genomes, protecting themselves 
against being destroyed by restriction enzymes capa- 
ble of cutting up DNA molecules at certain palin- 
dromic sites. It will be interesting to investigate 
whether there is any possible interaction of the short 
palindromes in the coronavirus genomes with the 
immune system of the host cells that might have 
detrimental effects on the survival of the virus. 

Length-six palindromes are found significantly 
underrepresented only in SARS but not in the other 
six coronaviruses (see Table 4). Would this avoidance 
of length-six palindromes in the SARS genome offer 
a protective effect on the virus, making it compara- 
tively more difficult to be destroyed and contributing 
to the rapid spread and the severity of the disease? 
This will be an interesting point to observe as we seek 
to learn more about the SARS virus. 

Among all palindromes found in the seven coro- 
naviruses genomes we analyzed, the longest one 
resides in SARS. It is composed of the 22 bases 
TCTTTAACAAGCTTGITAAAGA spanning positions 25962- 
25983. Because the probability distribution of palin- 
drome lengths has not been rigorously obtained, we 
can only attempt a rough estimation, based on the 
simple MO sequence model, of observing a length-22 
palindrome in a genome with base composition like 
that of SARS. It has been demonstrated in Leung 
et al. (2002) that for larger values of L (say >5), we 
may approximate the counts of palindromes at or 
above length 2L by a Poisson random variable with 
parameter A equal to the expected count. We therefore 
have P[maximal palindrome length > 22] = P[X,, > 1], 
which can be approximated by the corresponding 
Poisson probability with Ay, = E(X,,) = 0.01008 by 
Proposition 3. This Poisson probability is equal to 
1—e-44, about 1%. 

Knowing that this long palindrome is quite unlikely 
to occur by chance, one would logically ask the ques- 
tion of whether it plays any particular functional role. 
According to the classification of open reading frames 
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(ORFs) encoding potential nonstructural proteins of 
the SARS virus (Rota et al. 2003, Table 1), this palin- 
drome occurs in the overlapping region of the two 
ORFs designated X1 and X2. Due to the location 
of this palindrome, it is tempting to speculate that 
it might be involved in some secondary structures 
serving similar purposes like those of a pseudoknot, 
which is typically found at frame-shift locations in 
overlapping coding sequences (Giedroc et al. 2000). 
One would have to perform a detailed secondary 
structure prediction on this part of the SARS and 
other coronavirus genomes before further suggestions 
can be made. The methods and tools used by Qin 
et al. (2003) to predict the secondary structure in 
another part of the SARS virus genome (around the 
packaging-signal sequence) are likely to be applicable 
here as well. 

Another feature unique to SARS is the occurrence 
of two repeating length-12 palindromes TTATAATTATAA 
spanning positions 22712-22723 and 22796-22807, 
all within 100 bases of the genome in the coding 
sequence of the surface-spike glycoprotein, which is 
important for virus entry and virus-receptor interac- 
tions (Yu et al. 2003). Both copies begin on the third 
position of a codon. Three amino acids Tyr-Asn-Tyr 
are coded by the second through tenth bases of 
the palindrome. No such repeating palindromes are 
observed in the corresponding glycoprotein-coding 
sequences for any of the other six coronaviruses. Prob- 
abilistic assessment of close repeating palindromes 
occurring in random sequences has yet to be formu- 
lated mathematically or estimated by simulation. (The 
method of Robin and Daudin 1999 can be used to 
assess the probability that a given palindrome repeats 
itself in close proximity.) If such an observation is 
found to be unlikely to occur by chance, then these 
repeating palindromes might be tested for potential 
regulatory functions. Large palindromes present in 
single-stranded RNA have the inherent ability to form 
double-stranded stem structures through the forma- 
tion of intramolecular base pairs; thus, it is possible 
that these sequences form secondary RNA structures 
in the genomic RNA and in one or more subgenomic 
RNAs of the SARS virus. In many of the single- 
stranded RNA viruses, stem structures play important 
regulatory roles in genome replication or gene expres- 
sion. It should be possible to investigate potential 
regulatory roles of these repeated length-12 palin- 
dromes by engineering silent mutations within these 
sequences such that the encoded protein is not altered 
but the palindromes and putative secondary struc- 
tures are lost. 


5. Concluding Remarks 
While we hope that there will never be another out- 
break of SARS, we believe that detailed analysis of 


the SARS genome sequence can help generate useful 
information for understanding the biology of the 
coronaviruses and perhaps other RNA viruses in gen- 
eral. This first exploration about palindromes in the 
coronavirus family generates many questions to be 
investigated in greater detail mathematically, compu- 
tationally, as well as biologically. 

Closely related to palindromes is the sequence fea- 
ture of close inversion, which is a palindrome with 
its two halves separated by a short stretch of inter- 
vening nucleotides. These close inversions are well 
known to form stem-loop and other secondary struc- 
tures involved in the viral recombination and pack- 
aging process (Rowe et al. 1997, Qin et al. 2003). 
We anticipate that a set of interesting and challeng- 
ing questions in random-sequence models will again 
emerge from the analysis of close inversions. 
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